Index ¦ Archives

Mapping Big Data with Dask and Datashader

For a long time plotting large quantities of data in python notebook wasn't exactly fun. Classical plotting packages, sich as matplotlib, seaborn, plotly and bokeh, were not able to process large quantities of data into points - especially, if we talk about interactive visualisation.

However, recently, a few new technologies start gaining attention. In particular, I am talking about JIT (just-in-time-complier) and corresponding package for python - Numba. Numba allows to boost standart numberical computations in python - in fact, traversing through python loop with numba often even faster, than using numpy!

On the other hand, Dask had arrive - a sort of a lightweight multiprocessing wrap around numpy and pandas, helping to work on large (larger than RAM) datasets.

Mixing the two, Bokeh team introduced a datashader - new awesome plotting library, that calculates and plots large datasets into image or intaractive visualisation. Datashader is superawesome (yet, very young and still changes it's API)

In this notebook I wil show, how to plot a large dataset (36 million points, in this case) on a single machine using two new libraries, - dask and datashader

First, of course, we need to import all modules we want

In [5]:
%matplotlib inline
import pylab as plt
In [6]:
from ipynotifyer import notifyOnComplete as nf
In [7]:
import numpy as np
import pandas as pd
In [8]:
import datashader as ds
import datashader.transfer_functions as tf
In [9]:
from dask import dataframe as dd
import dask
In [10]:
from functools import partial

from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9, Hot, viridis, inferno
from IPython.core.display import HTML, display
In [11]:
from pyproj import Proj # reproject points to State Plane

nyc = Proj(init='epsg:2263')

def reproj(df, prj=nyc):
    d = nyc(df['lon'].values,  df['lat'].values)
    df[['x','y']] = pd.DataFrame({'x':d[0],'y':d[1]})
    return df

Get the data

Now, let's get the dataset loaded.

In [12]:
dsk = dd.read_csv('data/data*.csv', encoding='utf8')

Let's count rows

In [8]:
len(dsk) # size of the dataset
/Users/casy/anaconda/lib/python2.7/multiprocessing/pool.py:113: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
  result = (True, func(*args, **kwds))
Out[8]:
35998001

Process the data

  • column as categorical, lower
In [13]:
dsk = dsk.assign(application=dsk.application.str.lower())
  • reproject to NYC state plane
In [14]:
dsk = dsk.map_partitions(reproj)
  • add daytime in seconds
In [15]:
dsk = dsk.assign(daytime=dsk.timestamp.mod(86500))

Now let`s play with dask graph visualisation, just because it is awesome. As we can see, data is split into many "chunks" of data, and for each a set of transformations is performed (all operations are row-wise for now).

In [16]:
dsk.visualize()
Out[16]:

And now let's actually compute the result.

In [17]:
d = dsk.compute()
/Users/casy/anaconda/lib/python2.7/multiprocessing/pool.py:113: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
  result = (True, func(*args, **kwds))

VISUALISATION

Now lets prepare to visualise our map using datashader.

First, let's define a canvas size

In [19]:
plot_width  = int(1000)
plot_height = plot_width

background = "black"

Datashader examples propose to use partial helper - we don't want to define background stile every time

In [20]:
export = partial(export_image, background = background)
cm = partial(colormap_select, reverse=(background!="black"))

Also, we need notebook to be large

In [21]:
display(HTML(""))

Now let's define our data-side canvas coordinates. we can simply reproj them from lot/lan as well

In [22]:
sw = nyc( -74.15, 40.463661  ) # reproj
ne = nyc( -73.66, 40.947435  ) # reproj

NYC = x_range, y_range = zip(sw, ne)

cvs = ds.Canvas(plot_width, plot_height, *NYC)

Density

First, lets just count tweets for each point.

In [23]:
count = cvs.points(d, 'x', 'y')

Lets start with linear color interpolation. That means, that difference in color or/and brightness between two pixels is linearly proportional to their corresponding values. ost of the time, it is a bad idea, as a few spots will overcome general population. Yet, lets give it a try.

In [24]:
export(tf.interpolate(count, cmap = Greys9, how='linear'),'tweets_density_linear')
Out[24]:

As we expected, it is really not helping,s o lets stick with equal histogram. Equal histogram means, that for each color in the colormap, buckets are adjusted, so tha each color represents equal number of points

In [25]:
export(tf.interpolate(count, cmap = Greys9, how='eq_hist'),'tweets3')
Out[25]:

Now, grey is kinda boring, lets change a color scheme. Worth noticing, that we don't do any heavy computation here - all counts are done already in .points()function. All we are doing now, is printing a 2d-matrix

In [27]:
export(tf.interpolate(count, cmap=viridis, how='eq_hist'), 'colored_total')
Out[27]:

Applications

Now, lets define, which of 4 top application is the most popular for each point

I actually started defining colors. Strange thing to start with, but this way I am able to use keys to filter apps later

In [28]:
if background == "black":
      color_key = {'foursquare':'aqua', 
                   'twitter for iphone':'white', 
                   'instagram':'red', 
                   'twitter for android':'lime'}
else: color_key = {'foursquare':'blue', 
                   'twitter for iphone':'white', 
                   'instagram':'red', 
                   'twitter for android':'lime'} 

Filter data for top-4 applications, just as with pandas

In [29]:
appDf = d[d.application.isin(color_key.keys())]

Now, lets turn application to categorical type

In [30]:
appDf = appDf.assign(application=appDf.application.astype('category'))
In [48]:
appDf.application.value_counts()
Out[48]:
twitter for iphone     20081109
instagram               5575630
twitter for android     4823926
foursquare              2506839
Name: application, dtype: int64

Now count by category

In [31]:
appCount = cvs.points(appDf, 'x', 'y', ds.count_cat('application'))

And plot

In [32]:
export(tf.colorize(appCount, color_key, how='eq_hist'), 'colored_apps')
Out[32]:

Daytime

now, lets visualise our daytime. Here, I use "hsv" colormap, as I want numbers for 00:05 and for 23:55 to be close enough.

Also, I remove noise (points with less than 10 tweets), using count aggregate, which we already computed

In [33]:
treshold = 10
In [34]:
aggDaytime = cvs.points(d, 'x', 'y', agg=ds.mean('daytime'))
In [35]:
colormap = plt.get_cmap('hsv')
export(tf.interpolate(aggDaytime.where(count > treshold ), cmap=colormap, how='eq_hist'), 'colored_total')
Out[35]:

Datashader is incredible! Next time i will play with an interactive part of it

feel free to ask / suggest anything via casyfill@gmail.com

© Philipp Kats. Built using Pelican. Theme by Giulio Fidente on github.